Chapter 2: Quick Start Guide

Welcome to the Scikit-learn Quick Start Guide! This chapter will take you through a complete machine learning example, allowing you to quickly experience the entire process from data loading to model prediction.

2.1 Basic Machine Learning Concepts

Before starting coding, let's understand several core concepts:

Features: Input variables used for prediction, usually denoted as X
Labels: Target variables we want to predict, usually denoted as y
Training Set: Data used for training the model
Test Set: Data used for evaluating model performance
Model: Algorithm that learns data patterns

2.2 First Machine Learning Project: Iris Classification

We will use the famous Iris dataset to build a classification model. This dataset contains 150 iris samples, each with 4 features, requiring prediction of 3 flower categories.

Step 1: Import Necessary Libraries

# Import basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import Scikit-learn components
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Set Chinese font (optional)
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

Step 2: Load and Explore Data

# Load Iris dataset
iris = load_iris()

# View dataset information
print("Dataset Description:")
print(iris.DESCR[:500] + "...")
print("\n" + "="*50)

# Get features and labels
X = iris.data  # Feature matrix
y = iris.target  # Label vector

# View data shapes
print(f"Feature Matrix Shape: {X.shape}")
print(f"Label Vector Shape: {y.shape}")

# View feature names
print(f"Feature Names: {iris.feature_names}")
print(f"Class Names: {iris.target_names}")

Step 3: Data Exploration and Visualization

# Create DataFrame for easier analysis
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = iris.target_names[y]

# View first few rows of data
print("First 5 Rows of Data:")
print(df.head())

# View data statistics
print("\nData Statistics:")
print(df.describe())

# View class distribution
print("\nClass Distribution:")
print(df['species'].value_counts())

Step 4: Data Visualization

# Create figure
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Iris Dataset Feature Distribution', fontsize=16)

# Plot distribution of each feature
features = iris.feature_names
for i, feature in enumerate(features):
    row = i // 2
    col = i % 2
    
    # Plot histogram by category
    for species in iris.target_names:
        data = df[df['species'] == species][feature]
        axes[row, col].hist(data, alpha=0.7, label=species, bins=15)
    
    axes[row, col].set_title(feature)
    axes[row, col].set_xlabel('Value')
    axes[row, col].set_ylabel('Frequency')
    axes[row, col].legend()

plt.tight_layout()
plt.show()

# Plot relationships between features
plt.figure(figsize=(10, 8))
sns.pairplot(df, hue='species', markers=["o", "s", "D"])
plt.suptitle('Feature Relationships', y=1.02)
plt.show()

Step 5: Data Splitting

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% as test set
    random_state=42,      # Set random seed for reproducibility
    stratify=y           # Maintain class proportions
)

print(f"Training Set Size: {X_train.shape[0]}")
print(f"Test Set Size: {X_test.shape[0]}")

# View class distribution after splitting
print("\nTraining Set Class Distribution:")
unique, counts = np.unique(y_train, return_counts=True)
for i, count in enumerate(counts):
    print(f"{iris.target_names[unique[i]]}: {count}")

print("\nTest Set Class Distribution:")
unique, counts = np.unique(y_test, return_counts=True)
for i, count in enumerate(counts):
    print(f"{iris.target_names[unique[i]]}: {count}")

Step 6: Train Model

# Create logistic regression model
model = LogisticRegression(random_state=42)

# Train model
print("Training model...")
model.fit(X_train, y_train)
print("Model training complete!")

# View model parameters
print(f"\nModel Coefficient Shape: {model.coef_.shape}")
print(f"Model Intercept: {model.intercept_}")

Step 7: Model Prediction

# Make predictions on test set
y_pred = model.predict(X_test)

# Prediction probabilities
y_pred_proba = model.predict_proba(X_test)

print("Prediction Results (First 10):")
for i in range(min(10, len(y_test))):
    true_label = iris.target_names[y_test[i]]
    pred_label = iris.target_names[y_pred[i]]
    confidence = np.max(y_pred_proba[i])
    print(f"True: {true_label:12} | Predicted: {pred_label:12} | Confidence: {confidence:.3f}")

Step 8: Model Evaluation

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.3f} ({accuracy*100:.1f}%)")

# Detailed classification report
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=iris.target_names,
            yticklabels=iris.target_names)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Step 9: Model Application

# Create new samples for prediction
new_samples = np.array([
    [5.1, 3.5, 1.4, 0.2],  # Features similar to setosa
    [6.2, 2.8, 4.8, 1.8],  # Features similar to versicolor
    [7.2, 3.0, 5.8, 1.6]   # Features similar to virginica
])

# Make predictions
predictions = model.predict(new_samples)
probabilities = model.predict_proba(new_samples)

print("New Sample Prediction Results:")
for i, (sample, pred, prob) in enumerate(zip(new_samples, predictions, probabilities)):
    print(f"\nSample {i+1}: {sample}")
    print(f"Predicted Class: {iris.target_names[pred]}")
    print(f"Class Probabilities:")
    for j, class_name in enumerate(iris.target_names):
        print(f"  {class_name}: {prob[j]:.3f}")

2.3 Complete Code Example

Integrate all the above steps into a complete script:

# iris_classification.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

def main():
    # 1. Load data
    iris = load_iris()
    X, y = iris.data, iris.target
    
    # 2. Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    # 3. Train model
    model = LogisticRegression(random_state=42)
    model.fit(X_train, y_train)
    
    # 4. Predict and evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    print(f"Model Accuracy: {accuracy:.3f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=iris.target_names))
    
    return model, iris

if __name__ == "__main__":
    model, iris = main()

2.4 Core Advantages of Scikit-learn

Through this simple example, we can see several core advantages of Scikit-learn:

1. Unified API Design

All models follow the same interface:

fit(X, y): Train model
predict(X): Make predictions
score(X, y): Evaluate performance

2. Rich Algorithm Library

# Can easily switch between different algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Support Vector Machine
svm_model = SVC(random_state=42)
svm_model.fit(X_train, y_train)

# K-Nearest Neighbors
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(X_train, y_train)

3. Complete Tool Chain

Data preprocessing
Model selection
Performance evaluation
Cross-validation

2.5 Common Machine Learning Workflow

# Standard machine learning workflow
def ml_workflow(X, y, model_class, **model_params):
    """
    Standard machine learning workflow
    """
    # 1. Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # 2. Create and train model
    model = model_class(**model_params)
    model.fit(X_train, y_train)
    
    # 3. Predict and evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    return model, accuracy

# Usage example
from sklearn.ensemble import RandomForestClassifier

model, acc = ml_workflow(X, y, RandomForestClassifier, n_estimators=100)
print(f"Random Forest Accuracy: {acc:.3f}")

2.6 Exercises

Exercise 1: Basic Operations

Use different test_size values (0.1, 0.3, 0.4) and observe the impact on model performance
Try not setting random_state and run the code multiple times to observe result changes

Exercise 2: Algorithm Comparison

Use the following algorithms to classify the Iris dataset and compare their performance:

Decision Tree (DecisionTreeClassifier)
Random Forest (RandomForestClassifier)
Support Vector Machine (SVC)

Exercise 3: Data Exploration

Calculate the mean and standard deviation of each feature
Find which two features have the strongest correlation
Plot the distribution of each class in a two-dimensional feature space

Exercise 4: Predict New Samples

Create 5 new Iris samples, use the trained model to make predictions, and analyze prediction confidence.

2.7 Summary

In this chapter, we learned:

Basic Machine Learning Concepts: Features, labels, training set, test set
Complete ML Workflow: Data loading → exploration → splitting → training → prediction → evaluation
Scikit-learn Core API: fit(), predict(), score()
Model Evaluation Methods: Accuracy, classification report, confusion matrix
Data Visualization Techniques: Histograms, scatter plots, heatmaps

Key Points

Scikit-learn provides unified, concise API
Machine learning projects follow a standard workflow
Data exploration and visualization are important first steps
Model evaluation is more than just looking at accuracy

2.8 Next Steps

Now you have experienced the complete machine learning workflow! In the next chapter Data Preprocessing Basics, we will dive deeper into how to handle "dirty" data in the real world, which is a key step to successfully building machine learning models.

Chapter Key Points Review:

✅ Mastered basic usage of Scikit-learn
✅ Understood standard machine learning workflow
✅ Learned basic model evaluation methods
✅ Experienced the complete process from data to prediction

#Chapter 2: Quick Start Guide

#2.1 Basic Machine Learning Concepts

#2.2 First Machine Learning Project: Iris Classification

#Step 1: Import Necessary Libraries

#Step 2: Load and Explore Data

#Step 3: Data Exploration and Visualization

#Step 4: Data Visualization

#Step 5: Data Splitting

#Step 6: Train Model

#Step 7: Model Prediction

#Step 8: Model Evaluation

#Step 9: Model Application

#2.3 Complete Code Example

#2.4 Core Advantages of Scikit-learn

#1. Unified API Design

#2. Rich Algorithm Library

#3. Complete Tool Chain

#2.5 Common Machine Learning Workflow

#2.6 Exercises

#Exercise 1: Basic Operations

#Exercise 2: Algorithm Comparison

#Exercise 3: Data Exploration

#Exercise 4: Predict New Samples

#2.7 Summary

#Key Points

#2.8 Next Steps