Skip to content

Chapter 9: Naive Bayes

Naive Bayes is a probabilistic classification algorithm based on Bayes' theorem, known for its simplicity, efficiency, and good performance. Despite its "naive" assumptions, it performs excellently in many practical applications, particularly in fields like text classification and spam filtering.

9.1 What is Naive Bayes?

Naive Bayes is based on Bayes' theorem and assumes that features are independent of each other (this is where the "naive" assumption comes from). Although this assumption often doesn't hold in reality, Naive Bayes still performs well in many scenarios.

9.1.1 Bayes' Theorem

Bayes' theorem describes the probability of an event occurring given certain conditions:

P(A|B) = P(B|A) × P(A) / P(B)

In classification problems:

P(Class|Features) = P(Features|Class) × P(Class) / P(Features)

9.1.2 The Naive Assumption

Naive Bayes assumes that all features are conditionally independent given the class:

P(x₁, x₂, ..., xₙ|y) = P(x₁|y) × P(x₂|y) × ... × P(xₙ|y)

9.1.3 Advantages of Naive Bayes

  • Fast training: Only need to calculate probability distributions
  • Fast prediction: Simple probability calculations
  • Memory efficient: Only need to store probability parameters
  • Handles multi-class problems: Naturally supports multi-class classification
  • Friendly to small datasets: Doesn't require large amounts of training data
  • Provides probability output: Gives confidence in predictions

9.1.4 Disadvantages of Naive Bayes

  • Independence assumption: Features are often correlated in reality
  • Sensitive to input: Needs smoothing for zero probabilities
  • Handling continuous features: Requires assuming a distribution type

9.2 Setting Up Environment and Data

python
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, load_iris, load_wine, fetch_20newsgroups
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB, ComplementNB
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    roc_curve, auc, precision_recall_curve
)
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

# Set plot style
plt.style.use('seaborn-v0_8')
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

9.3 Gaussian Naive Bayes

9.3.1 Basic Principles

Gaussian Naive Bayes assumes that each feature follows a normal distribution given the class.

python
# Demonstrate basic principles of Gaussian Naive Bayes
def demonstrate_gaussian_nb_principle():
    """Demonstrate basic principles of Gaussian Naive Bayes"""

    # Create simple binary classification data
    np.random.seed(42)

    # Class 0: mean [2, 2], std [1, 1]
    class0_x1 = np.random.normal(2, 1, 100)
    class0_x2 = np.random.normal(2, 1, 100)

    # Class 1: mean [-2, -2], std [1, 1]
    class1_x1 = np.random.normal(-2, 1, 100)
    class1_x2 = np.random.normal(-2, 1, 100)

    X = np.vstack([np.column_stack([class0_x1, class0_x2]),
                   np.column_stack([class1_x1, class1_x2])])
    y = np.hstack([np.zeros(100), np.ones(100)])

    # Visualize data and distributions
    fig, axes = plt.subplots(1, 3, figsize=(18, 6))

    # Original data
    colors = ['red', 'blue']
    for i, color in enumerate(colors):
        mask = y == i
        axes[0].scatter(X[mask, 0], X[mask, 1], c=color, alpha=0.6, label=f'Class {i}')

    axes[0].set_xlabel('Feature 1')
    axes[0].set_ylabel('Feature 2')
    axes[0].set_title('Original Data Distribution')
    axes[0].legend()
    axes[0].grid(True, alpha=0.3)

    # Distribution of feature 1
    for i, color in enumerate(colors):
        mask = y == i
        axes[1].hist(X[mask, 0], bins=20, alpha=0.6, color=color, label=f'Class {i}')

    axes[1].set_xlabel('Feature 1')
    axes[1].set_ylabel('Frequency')
    axes[1].set_title('Distribution of Feature 1')
    axes[1].legend()
    axes[1].grid(True, alpha=0.3)

    # Distribution of feature 2
    for i, color in enumerate(colors):
        mask = y == i
        axes[2].hist(X[mask, 1], bins=20, alpha=0.6, color=color, label=f'Class {i}')

    axes[2].set_xlabel('Feature 2')
    axes[2].set_ylabel('Frequency')
    axes[2].set_title('Distribution of Feature 2')
    axes[2].legend()
    axes[2].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    return X, y

X_demo, y_demo = demonstrate_gaussian_nb_principle()

9.3.2 Training Gaussian Naive Bayes

python
# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X_demo, y_demo, test_size=0.2, random_state=42, stratify=y_demo
)

# Create Gaussian Naive Bayes classifier
gnb = GaussianNB()
gnb.fit(X_train, y_train)

# Predict
y_pred = gnb.predict(X_test)
y_pred_proba = gnb.predict_proba(X_test)

# Evaluate
accuracy = accuracy_score(y_test, y_pred)
print(f"Gaussian Naive Bayes accuracy: {accuracy:.4f}")

print("\nDetailed classification report:")
print(classification_report(y_test, y_pred))

# View learned parameters
print(f"\nModel parameters:")
print(f"Class prior probabilities: {gnb.class_prior_}")
print(f"Feature means:")
for i, class_mean in enumerate(gnb.theta_):
    print(f"  Class {i}: {class_mean}")
print(f"Feature variances:")
for i, class_var in enumerate(gnb.sigma_):
    print(f"  Class {i}: {class_var}")

9.3.3 Decision Boundary Visualization

python
def plot_nb_decision_boundary(X, y, model, title="Naive Bayes Decision Boundary"):
    """Plot decision boundary for Naive Bayes"""
    plt.figure(figsize=(10, 8))

    # Create grid
    h = 0.02
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))

    # Predict grid points
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Predict probabilities
    Z_proba = model.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    Z_proba = Z_proba.reshape(xx.shape)

    # Plot decision boundary and probability contours
    plt.contourf(xx, yy, Z_proba, levels=50, alpha=0.8, cmap='RdYlBu')
    plt.colorbar(label='P(Class=1)')

    # Plot decision boundary
    plt.contour(xx, yy, Z_proba, levels=[0.5], colors='black', linestyles='--', linewidths=2)

    # Plot data points
    colors = ['red', 'blue']
    for i, color in enumerate(colors):
        mask = y == i
        plt.scatter(X[mask, 0], X[mask, 1],
                   c=color, label=f'Class {i}', alpha=0.7, edgecolors='black')

    plt.xlabel('Feature 1')
    plt.ylabel('Feature 2')
    plt.title(title)
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# Plot decision boundary
plot_nb_decision_boundary(X_train, y_train, gnb, "Gaussian Naive Bayes Decision Boundary")

9.3.4 Comparison with Other Algorithms

python
# Compare Naive Bayes with other algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Use iris dataset for comparison
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

X_train_iris, X_test_iris, y_train_iris, y_test_iris = train_test_split(
    X_iris, y_iris, test_size=0.2, random_state=42, stratify=y_iris
)

# Define algorithms
algorithms = {
    'Gaussian Naive Bayes': GaussianNB(),
    'Logistic Regression': LogisticRegression(random_state=42, max_iter=1000),
    'SVM': SVC(random_state=42, probability=True),
    'Decision Tree': DecisionTreeClassifier(random_state=42)
}

results = {}

print("Algorithm performance comparison (Iris dataset):")
print("Algorithm\t\t\tAccuracy\t\tCross-validation Score")
print("-" * 50)

for name, algorithm in algorithms.items():
    # Train and predict
    algorithm.fit(X_train_iris, y_train_iris)
    y_pred_iris = algorithm.predict(X_test_iris)

    # Performance metrics
    accuracy_iris = accuracy_score(y_test_iris, y_pred_iris)
    cv_scores = cross_val_score(algorithm, X_iris, y_iris, cv=5)
    cv_mean = np.mean(cv_scores)

    results[name] = {'accuracy': accuracy_iris, 'cv_score': cv_mean}
    print(f"{name}\t{accuracy_iris:.4f}\t\t{cv_mean:.4f}")

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 6))

names = list(results.keys())
accuracies = [results[name]['accuracy'] for name in names]
cv_scores = [results[name]['cv_score'] for name in names]

# Test accuracy
axes[0].bar(names, accuracies, color='skyblue', alpha=0.7)
axes[0].set_title('Test Set Accuracy Comparison')
axes[0].set_ylabel('Accuracy')
axes[0].tick_params(axis='x', rotation=45)
axes[0].set_ylim(0.8, 1.0)

# Cross-validation scores
axes[1].bar(names, cv_scores, color='lightgreen', alpha=0.7)
axes[1].set_title('Cross-validation Score Comparison')
axes[1].set_ylabel('CV Score')
axes[1].tick_params(axis='x', rotation=45)
axes[1].set_ylim(0.8, 1.0)

plt.tight_layout()
plt.show()

9.4 Multinomial Naive Bayes

9.4.1 Text Classification Application

Multinomial Naive Bayes is particularly suitable for handling discrete features, such as word frequencies in text data.

python
# Create text classification data
texts = [
    # Technology
    "人工智能技术发展迅速,机器学习算法不断改进",
    "深度学习在图像识别领域取得重大突破",
    "云计算和大数据技术推动数字化转型",
    "区块链技术在金融领域应用广泛",
    "物联网设备连接数量快速增长",
    "5G网络建设加速推进",
    "自动驾驶汽车技术日趋成熟",
    "量子计算研究取得新进展",

    # Sports
    "足球比赛精彩激烈,球员表现出色",
    "篮球联赛进入季后赛阶段",
    "游泳运动员打破世界纪录",
    "网球公开赛决赛即将开始",
    "马拉松比赛吸引众多跑者参与",
    "体操运动员展现完美技巧",
    "羽毛球世锦赛激战正酣",
    "滑雪运动在冬季备受欢迎",

    # Food
    "川菜以麻辣著称,口味独特",
    "粤菜注重原汁原味,制作精细",
    "意大利面条搭配各种酱料",
    "日式料理追求食材新鲜",
    "法式甜点制作工艺复杂",
    "烧烤美食深受大众喜爱",
    "海鲜料理营养丰富美味",
    "素食餐厅越来越受欢迎"
]

labels = [0]*8 + [1]*8 + [2]*8  # 0-Technology, 1-Sports, 2-Food
label_names = ['Technology', 'Sports', 'Food']

print(f"Text dataset information:")
print(f"Total texts: {len(texts)}")
print(f"Class distribution: {np.bincount(labels)}")

# Text vectorization
vectorizer = CountVectorizer(max_features=100)
X_text = vectorizer.fit_transform(texts)

print(f"Feature dimensions: {X_text.shape}")
print(f"Feature vocabulary: {len(vectorizer.get_feature_names_out())}")

# Split data
X_train_text, X_test_text, y_train_text, y_test_text = train_test_split(
    X_text, labels, test_size=0.3, random_state=42, stratify=labels
)

# Train Multinomial Naive Bayes
mnb = MultinomialNB(alpha=1.0)  # alpha is Laplace smoothing parameter
mnb.fit(X_train_text, y_train_text)

# Predict
y_pred_text = mnb.predict(X_test_text)
y_pred_proba_text = mnb.predict_proba(X_test_text)

# Evaluate
accuracy_text = accuracy_score(y_test_text, y_pred_text)
print(f"\nMultinomial Naive Bayes text classification accuracy: {accuracy_text:.4f}")

print("\nDetailed classification report:")
print(classification_report(y_test_text, y_pred_text, target_names=label_names))

9.4.2 Feature Importance Analysis

python
# Analyze most important feature words
feature_names = vectorizer.get_feature_names_out()

# Get log probability of features for each class
feature_log_prob = mnb.feature_log_prob_

print("Most important feature words for each class:")
print("=" * 50)

for i, class_name in enumerate(label_names):
    print(f"\nMost important words for {class_name} class:")

    # Get feature probabilities for this class
    class_prob = feature_log_prob[i]

    # Find words with highest probability
    top_indices = np.argsort(class_prob)[-10:]

    for j, idx in enumerate(reversed(top_indices)):
        word = feature_names[idx]
        prob = np.exp(class_prob[idx])
        print(f"  {j+1:2d}. {word}: {prob:.4f}")

# Visualize feature importance
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for i, class_name in enumerate(label_names):
    class_prob = np.exp(feature_log_prob[i])
    top_indices = np.argsort(class_prob)[-10:]
    top_words = [feature_names[idx] for idx in top_indices]
    top_probs = [class_prob[idx] for idx in top_indices]

    axes[i].barh(range(len(top_words)), top_probs)
    axes[i].set_yticks(range(len(top_words)))
    axes[i].set_yticklabels(top_words)
    axes[i].set_xlabel('Probability')
    axes[i].set_title(f'Important Feature Words for {class_name} Class')
    axes[i].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

9.4.3 Effect of Smoothing Parameter

python
# Analyze effect of Laplace smoothing parameter alpha
alpha_values = [0.1, 0.5, 1.0, 2.0, 5.0]
alpha_results = {}

print("Effect of Laplace smoothing parameter alpha:")
print("alpha\tAccuracy\t\tCross-validation Score")
print("-" * 40)

for alpha in alpha_values:
    mnb_alpha = MultinomialNB(alpha=alpha)
    mnb_alpha.fit(X_train_text, y_train_text)

    # Test set performance
    y_pred_alpha = mnb_alpha.predict(X_test_text)
    accuracy_alpha = accuracy_score(y_test_text, y_pred_alpha)

    # Cross-validation
    cv_scores = cross_val_score(mnb_alpha, X_text, labels, cv=5)
    cv_mean = np.mean(cv_scores)

    alpha_results[alpha] = {'accuracy': accuracy_alpha, 'cv_score': cv_mean}
    print(f"{alpha}\t{accuracy_alpha:.4f}\t\t{cv_mean:.4f}")

# Visualize effect of alpha parameter
plt.figure(figsize=(10, 6))
alphas = list(alpha_results.keys())
accuracies = [alpha_results[alpha]['accuracy'] for alpha in alphas]
cv_scores = [alpha_results[alpha]['cv_score'] for alpha in alphas]

plt.plot(alphas, accuracies, 'o-', label='Test Accuracy', linewidth=2, markersize=8)
plt.plot(alphas, cv_scores, 's-', label='Cross-validation Score', linewidth=2, markersize=8)

plt.xlabel('Alpha Parameter')
plt.ylabel('Performance Score')
plt.title('Effect of Laplace Smoothing Parameter on Performance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.xscale('log')
plt.show()

9.5 Bernoulli Naive Bayes

9.5.1 Binary Feature Processing

Bernoulli Naive Bayes is suitable for handling binary features, such as whether a document contains a certain word.

python
# Create binary feature data
# Convert text to binary features (whether word appears)
binary_vectorizer = CountVectorizer(binary=True, max_features=50)
X_binary = binary_vectorizer.fit_transform(texts)

print(f"Binary feature data shape: {X_binary.shape}")
print(f"Feature example (first 5 documents, first 10 features):")
print(X_binary[:5, :10].toarray())

# Split data
X_train_binary, X_test_binary, y_train_binary, y_test_binary = train_test_split(
    X_binary, labels, test_size=0.3, random_state=42, stratify=labels
)

# Train Bernoulli Naive Bayes
bnb = BernoulliNB(alpha=1.0)
bnb.fit(X_train_binary, y_train_binary)

# Predict
y_pred_binary = bnb.predict(X_test_binary)
accuracy_binary = accuracy_score(y_test_binary, y_pred_binary)

print(f"\nBernoulli Naive Bayes accuracy: {accuracy_binary:.4f}")

# Compare Multinomial and Bernoulli Naive Bayes
print("\nMultinomial vs Bernoulli Naive Bayes comparison:")
print("Model\t\t\tAccuracy")
print("-" * 30)

# Multinomial Naive Bayes (using binary data)
mnb_binary = MultinomialNB(alpha=1.0)
mnb_binary.fit(X_train_binary, y_train_binary)
y_pred_mnb_binary = mnb_binary.predict(X_test_binary)
accuracy_mnb_binary = accuracy_score(y_test_binary, y_pred_mnb_binary)

print(f"Multinomial Naive Bayes\t{accuracy_mnb_binary:.4f}")
print(f"Bernoulli Naive Bayes\t{accuracy_binary:.4f}")

9.5.2 Effect of Feature Selection

python
# Analyze effect of number of features on performance
feature_numbers = [10, 20, 50, 100, 200]
performance_comparison = {}

for n_features in feature_numbers:
    # Create vectorizers with different feature counts
    vec_count = CountVectorizer(max_features=n_features)
    vec_binary = CountVectorizer(binary=True, max_features=n_features)

    # Vectorize
    X_count = vec_count.fit_transform(texts)
    X_bin = vec_binary.fit_transform(texts)

    # Split data
    X_train_c, X_test_c, y_train_c, y_test_c = train_test_split(
        X_count, labels, test_size=0.3, random_state=42, stratify=labels
    )
    X_train_b, X_test_b, y_train_b, y_test_b = train_test_split(
        X_bin, labels, test_size=0.3, random_state=42, stratify=labels
    )

    # Train models
    mnb_temp = MultinomialNB(alpha=1.0)
    bnb_temp = BernoulliNB(alpha=1.0)

    mnb_temp.fit(X_train_c, y_train_c)
    bnb_temp.fit(X_train_b, y_train_b)

    # Predict
    y_pred_mnb_temp = mnb_temp.predict(X_test_c)
    y_pred_bnb_temp = bnb_temp.predict(X_test_b)

    # Calculate accuracy
    acc_mnb = accuracy_score(y_test_c, y_pred_mnb_temp)
    acc_bnb = accuracy_score(y_test_b, y_pred_bnb_temp)

    performance_comparison[n_features] = {
        'multinomial': acc_mnb,
        'bernoulli': acc_bnb
    }

# Visualize effect of feature count
plt.figure(figsize=(10, 6))
features = list(performance_comparison.keys())
mnb_accs = [performance_comparison[f]['multinomial'] for f in features]
bnb_accs = [performance_comparison[f]['bernoulli'] for f in features]

plt.plot(features, mnb_accs, 'o-', label='Multinomial Naive Bayes', linewidth=2, markersize=8)
plt.plot(features, bnb_accs, 's-', label='Bernoulli Naive Bayes', linewidth=2, markersize=8)

plt.xlabel('Number of Features')
plt.ylabel('Accuracy')
plt.title('Effect of Feature Count on Naive Bayes Performance')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("Effect of feature count on performance:")
print("Features\tMultinomial NB\tBernoulli NB")
print("-" * 35)
for f in features:
    print(f"{f}\t{performance_comparison[f]['multinomial']:.4f}\t\t{performance_comparison[f]['bernoulli']:.4f}")

9.6 Complement Naive Bayes

9.6.1 Handling Imbalanced Data

Complement Naive Bayes is particularly suitable for handling imbalanced text classification problems.

python
# Create imbalanced text dataset
imbalanced_texts = texts[:12] + texts[16:20]  # Technology 12, Sports 4, Food 4
imbalanced_labels = [0]*12 + [1]*4 + [2]*4

print("Imbalanced dataset:")
print(f"Class distribution: {np.bincount(imbalanced_labels)}")
print(f"Class ratio: {np.bincount(imbalanced_labels) / len(imbalanced_labels)}")

# Vectorize
imbalanced_vectorizer = CountVectorizer(max_features=50)
X_imbalanced = imbalanced_vectorizer.fit_transform(imbalanced_texts)

# Split data
X_train_imb, X_test_imb, y_train_imb, y_test_imb = train_test_split(
    X_imbalanced, imbalanced_labels, test_size=0.3, random_state=42, stratify=imbalanced_labels
)

# Compare performance of different Naive Bayes algorithms on imbalanced data
nb_algorithms = {
    'Multinomial Naive Bayes': MultinomialNB(alpha=1.0),
    'Complement Naive Bayes': ComplementNB(alpha=1.0),
    'Bernoulli Naive Bayes': BernoulliNB(alpha=1.0)
}

print("\nPerformance comparison on imbalanced dataset:")
print("Algorithm\t\t\tAccuracy\t\tMacro F1")
print("-" * 50)

from sklearn.metrics import f1_score

for name, algorithm in nb_algorithms.items():
    if name == 'Bernoulli Naive Bayes':
        # Create binary features for Bernoulli Naive Bayes
        X_train_temp = (X_train_imb > 0).astype(int)
        X_test_temp = (X_test_imb > 0).astype(int)
    else:
        X_train_temp = X_train_imb
        X_test_temp = X_test_imb

    algorithm.fit(X_train_temp, y_train_imb)
    y_pred_imb = algorithm.predict(X_test_temp)

    accuracy_imb = accuracy_score(y_test_imb, y_pred_imb)
    f1_macro = f1_score(y_test_imb, y_pred_imb, average='macro')

    print(f"{name}\t{accuracy_imb:.4f}\t\t{f1_macro:.4f}")

# Detailed analysis of Complement Naive Bayes performance
cnb = ComplementNB(alpha=1.0)
cnb.fit(X_train_imb, y_train_imb)
y_pred_cnb = cnb.predict(X_test_imb)

print(f"\nComplement Naive Bayes detailed classification report:")
print(classification_report(y_test_imb, y_pred_cnb, target_names=label_names))

9.7 Practical Application Cases

9.7.1 Spam Filtering

python
# Create spam classification dataset
spam_emails = [
    "恭喜您中奖了!立即点击领取大奖!",
    "免费获得iPhone,仅限今天!",
    "投资理财,月收益30%,无风险!",
    "减肥药效果神奇,一周瘦10斤!",
    "贷款无需抵押,当天放款!",
    "点击链接获得免费礼品!",
    "特价商品,限时抢购!",
    "网络兼职,日赚500元!"
]

normal_emails = [
    "明天的会议改到下午3点,请准时参加。",
    "您的订单已发货,预计3天内到达。",
    "感谢您参加我们的产品发布会。",
    "请查收本月的工作报告。",
    "周末聚餐的地点定在市中心餐厅。",
    "项目进度更新,请查看附件。",
    "生日快乐!祝您身体健康!",
    "课程安排有调整,请注意查看。"
]

# Merge data
all_emails = spam_emails + normal_emails
email_labels = [1]*len(spam_emails) + [0]*len(normal_emails)  # 1-Spam, 0-Normal

print("Email classification dataset:")
print(f"Total emails: {len(all_emails)}")
print(f"Spam emails: {sum(email_labels)}")
print(f"Normal emails: {len(email_labels) - sum(email_labels)}")

# Text vectorization
email_vectorizer = TfidfVectorizer(max_features=100, stop_words=None)
X_emails = email_vectorizer.fit_transform(all_emails)

# Split data
X_train_email, X_test_email, y_train_email, y_test_email = train_test_split(
    X_emails, email_labels, test_size=0.3, random_state=42, stratify=email_labels
)

# Train Multinomial Naive Bayes
spam_classifier = MultinomialNB(alpha=1.0)
spam_classifier.fit(X_train_email, y_train_email)

# Predict
y_pred_email = spam_classifier.predict(X_test_email)
y_pred_proba_email = spam_classifier.predict_proba(X_test_email)

# Evaluate
accuracy_email = accuracy_score(y_test_email, y_pred_email)
print(f"\nSpam classification accuracy: {accuracy_email:.4f}")

print("\nDetailed classification report:")
print(classification_report(y_test_email, y_pred_email,
                          target_names=['Normal Email', 'Spam Email']))

# Analyze important features
feature_names_email = email_vectorizer.get_feature_names_out()
feature_log_prob_email = spam_classifier.feature_log_prob_

print("\nImportant feature words for spam emails:")
spam_prob = np.exp(feature_log_prob_email[1])  # Spam email class
top_spam_indices = np.argsort(spam_prob)[-10:]

for i, idx in enumerate(reversed(top_spam_indices)):
    word = feature_names_email[idx]
    prob = spam_prob[idx]
    print(f"  {i+1:2d}. {word}: {prob:.4f}")

print("\nImportant feature words for normal emails:")
normal_prob = np.exp(feature_log_prob_email[0])  # Normal email class
top_normal_indices = np.argsort(normal_prob)[-10:]

for i, idx in enumerate(reversed(top_normal_indices)):
    word = feature_names_email[idx]
    prob = normal_prob[idx]
    print(f"  {i+1:2d}. {word}: {prob:.4f}")

9.7.2 Sentiment Analysis

python
# Create sentiment analysis dataset
positive_reviews = [
    "这部电影太精彩了,演员表演出色!",
    "服务态度很好,菜品味道不错。",
    "产品质量优秀,物超所值。",
    "课程内容丰富,老师讲解清晰。",
    "环境优美,设施完善。",
    "工作人员热情友好,体验很棒。"
]

negative_reviews = [
    "电影剧情拖沓,浪费时间。",
    "服务差劲,态度恶劣。",
    "产品质量有问题,不推荐购买。",
    "课程内容过时,讲解不清楚。",
    "环境嘈杂,设施陈旧。",
    "工作人员不专业,体验糟糕。"
]

# Merge data
all_reviews = positive_reviews + negative_reviews
sentiment_labels = [1]*len(positive_reviews) + [0]*len(negative_reviews)  # 1-Positive, 0-Negative

print("Sentiment analysis dataset:")
print(f"Total reviews: {len(all_reviews)}")
print(f"Positive reviews: {sum(sentiment_labels)}")
print(f"Negative reviews: {len(sentiment_labels) - sum(sentiment_labels)}")

# Text vectorization
sentiment_vectorizer = TfidfVectorizer(max_features=50)
X_sentiment = sentiment_vectorizer.fit_transform(all_reviews)

# Train Naive Bayes classifier
sentiment_classifier = MultinomialNB(alpha=1.0)
sentiment_classifier.fit(X_sentiment, sentiment_labels)

# Test new reviews
test_reviews = [
    "这个产品真的很棒,强烈推荐!",
    "质量太差了,完全不值这个价格。",
    "服务还可以,但是有改进空间。"
]

test_vectors = sentiment_vectorizer.transform(test_reviews)
test_predictions = sentiment_classifier.predict(test_vectors)
test_probabilities = sentiment_classifier.predict_proba(test_vectors)

print("\nSentiment analysis results for new reviews:")
for i, review in enumerate(test_reviews):
    sentiment = "Positive" if test_predictions[i] == 1 else "Negative"
    confidence = np.max(test_probabilities[i])
    print(f"Review: {review}")
    print(f"Sentiment: {sentiment} (Confidence: {confidence:.3f})")
    print()

9.8 Naive Bayes Optimization Techniques

9.8.1 Feature Engineering

python
# Demonstrate effect of different feature engineering techniques on Naive Bayes
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectKBest, chi2

# Use original text data
original_texts = texts
original_labels = labels

# 1. Different vectorization methods
vectorizers = {
    'CountVectorizer': CountVectorizer(max_features=100),
    'TfidfVectorizer': TfidfVectorizer(max_features=100),
    'Binary CountVectorizer': CountVectorizer(binary=True, max_features=100)
}

print("Effect of different vectorization methods:")
print("Method\t\t\tAccuracy")
print("-" * 35)

for name, vectorizer in vectorizers.items():
    X_vec = vectorizer.fit_transform(original_texts)
    X_train_vec, X_test_vec, y_train_vec, y_test_vec = train_test_split(
        X_vec, original_labels, test_size=0.3, random_state=42, stratify=original_labels
    )

    if 'Binary' in name:
        nb_vec = BernoulliNB(alpha=1.0)
    else:
        nb_vec = MultinomialNB(alpha=1.0)

    nb_vec.fit(X_train_vec, y_train_vec)
    y_pred_vec = nb_vec.predict(X_test_vec)
    accuracy_vec = accuracy_score(y_test_vec, y_pred_vec)

    print(f"{name}\t{accuracy_vec:.4f}")

# 2. Effect of feature selection
print(f"\nEffect of feature selection:")
print("Features\t\tAccuracy")
print("-" * 25)

# Use TF-IDF vectorization
tfidf_vec = TfidfVectorizer(max_features=200)
X_tfidf = tfidf_vec.fit_transform(original_texts)

k_values = [10, 20, 50, 100, 150]
for k in k_values:
    # Use chi-square test for feature selection
    selector = SelectKBest(chi2, k=k)
    X_selected = selector.fit_transform(X_tfidf, original_labels)

    X_train_sel, X_test_sel, y_train_sel, y_test_sel = train_test_split(
        X_selected, original_labels, test_size=0.3, random_state=42, stratify=original_labels
    )

    nb_sel = MultinomialNB(alpha=1.0)
    nb_sel.fit(X_train_sel, y_train_sel)
    y_pred_sel = nb_sel.predict(X_test_sel)
    accuracy_sel = accuracy_score(y_test_sel, y_pred_sel)

    print(f"{k}\t\t{accuracy_sel:.4f}")

9.8.2 Ensemble Naive Bayes

python
# Create Naive Bayes ensemble model
from sklearn.ensemble import VotingClassifier

# Prepare different Naive Bayes models
# 1. Multinomial Naive Bayes based on word frequency
count_vec = CountVectorizer(max_features=100)
X_count = count_vec.fit_transform(original_texts)

# 2. Multinomial Naive Bayes based on TF-IDF
tfidf_vec = TfidfVectorizer(max_features=100)
X_tfidf = tfidf_vec.fit_transform(original_texts)

# 3. Bernoulli Naive Bayes based on binary features
binary_vec = CountVectorizer(binary=True, max_features=100)
X_binary = binary_vec.fit_transform(original_texts)

# Split data
X_train_count, X_test_count, y_train, y_test = train_test_split(
    X_count, original_labels, test_size=0.3, random_state=42, stratify=original_labels
)
X_train_tfidf, X_test_tfidf, _, _ = train_test_split(
    X_tfidf, original_labels, test_size=0.3, random_state=42, stratify=original_labels
)
X_train_binary, X_test_binary, _, _ = train_test_split(
    X_binary, original_labels, test_size=0.3, random_state=42, stratify=original_labels
)

# Train individual models
nb_count = MultinomialNB(alpha=1.0)
nb_tfidf = MultinomialNB(alpha=1.0)
nb_binary = BernoulliNB(alpha=1.0)

nb_count.fit(X_train_count, y_train)
nb_tfidf.fit(X_train_tfidf, y_train)
nb_binary.fit(X_train_binary, y_train)

# Predict
y_pred_count = nb_count.predict(X_test_count)
y_pred_tfidf = nb_tfidf.predict(X_test_tfidf)
y_pred_binary = nb_binary.predict(X_test_binary)

# Simple voting ensemble
ensemble_pred = []
for i in range(len(y_test)):
    votes = [y_pred_count[i], y_pred_tfidf[i], y_pred_binary[i]]
    ensemble_pred.append(max(set(votes), key=votes.count))

# Evaluate results
print("Naive Bayes ensemble results:")
print("Model\t\t\tAccuracy")
print("-" * 35)
print(f"Word Frequency MNB\t\t{accuracy_score(y_test, y_pred_count):.4f}")
print(f"TF-IDF MNB\t\t{accuracy_score(y_test, y_pred_tfidf):.4f}")
print(f"Binary BNB\t\t{accuracy_score(y_test, y_pred_binary):.4f}")
print(f"Voting Ensemble\t\t{accuracy_score(y_test, ensemble_pred):.4f}")

# Visualize ensemble effect
models = ['Word Freq NB', 'TF-IDF NB', 'Binary NB', 'Voting Ensemble']
accuracies = [
    accuracy_score(y_test, y_pred_count),
    accuracy_score(y_test, y_pred_tfidf),
    accuracy_score(y_test, y_pred_binary),
    accuracy_score(y_test, ensemble_pred)
]

plt.figure(figsize=(10, 6))
bars = plt.bar(models, accuracies, color=['skyblue', 'lightgreen', 'lightcoral', 'gold'], alpha=0.7)
plt.title('Naive Bayes Ensemble Effect Comparison')
plt.ylabel('Accuracy')
plt.ylim(0.6, 1.0)

# Add value labels
for bar, acc in zip(bars, accuracies):
    height = bar.get_height()
    plt.text(bar.get_x() + bar.get_width()/2., height + 0.01,
             f'{acc:.3f}', ha='center', va='bottom')

plt.tight_layout()
plt.show()

9.9 Exercises

Exercise 1: Basic Naive Bayes

  1. Train a Gaussian Naive Bayes classifier using the wine dataset
  2. Analyze the parameters learned by the model (means and variances)
  3. Compare performance before and after standardization

Exercise 2: Text Classification

  1. Collect or create a multi-class text dataset
  2. Compare performance of Multinomial and Bernoulli Naive Bayes
  3. Analyze the effect of different smoothing parameters on performance

Exercise 3: Feature Engineering

  1. Use a news dataset for text classification
  2. Compare effects of different vectorization methods (Count, TF-IDF, Binary)
  3. Use feature selection techniques to improve model performance

Exercise 4: Handling Imbalanced Data

  1. Create a severely imbalanced classification dataset
  2. Compare performance of different Naive Bayes algorithms
  3. Try using sampling techniques to improve performance

9.10 Summary

In this chapter, we have deeply learned various aspects of the Naive Bayes algorithm:

Core Concepts

  • Bayes' Theorem: Mathematical foundation of probabilistic reasoning
  • Naive Assumption: Feature independence assumption and its impact
  • Different Variants: Gaussian, Multinomial, Bernoulli, and Complement Naive Bayes

Main Techniques

  • Gaussian Naive Bayes: Handles continuous features, assumes normal distribution
  • Multinomial Naive Bayes: Handles discrete features, suitable for text classification
  • Bernoulli Naive Bayes: Handles binary features
  • Complement Naive Bayes: Handles imbalanced data

Practical Skills

  • Text Classification: Spam filtering, sentiment analysis
  • Feature Engineering: Vectorization, feature selection
  • Parameter Tuning: Selection and impact of smoothing parameters
  • Ensemble Methods: Combining different Naive Bayes models

Key Points

  • Naive Bayes is simple and efficient, suitable for rapid prototyping
  • Performs excellently on high-dimensional sparse data like text classification
  • Requires appropriate smoothing techniques to handle zero probabilities
  • Although the independence assumption is "naive", it often works effectively in practice

9.11 Next Steps

Now you have mastered Naive Bayes, this important probabilistic classification algorithm! In the next chapter K-Nearest Neighbors, we will learn a completely different approach—instance-based learning, and understand the machine learning philosophy of "birds of a feather flock together."


Chapter Key Points Review:

  • ✅ Understood Bayes' theorem and the naive assumption
  • ✅ Mastered application scenarios of different types of Naive Bayes
  • ✅ Learned the complete text classification workflow
  • ✅ Understood the importance of feature engineering for Naive Bayes
  • ✅ Mastered techniques for handling imbalanced data
  • ✅ Able to build practical text classification systems

Content is for learning and research only.