Skip to content

Chapter 2: Quick Start Guide

Welcome to the Scikit-learn Quick Start Guide! This chapter will take you through a complete machine learning example, allowing you to quickly experience the entire process from data loading to model prediction.

2.1 Basic Machine Learning Concepts

Before starting coding, let's understand several core concepts:

  • Features: Input variables used for prediction, usually denoted as X
  • Labels: Target variables we want to predict, usually denoted as y
  • Training Set: Data used for training the model
  • Test Set: Data used for evaluating model performance
  • Model: Algorithm that learns data patterns

2.2 First Machine Learning Project: Iris Classification

We will use the famous Iris dataset to build a classification model. This dataset contains 150 iris samples, each with 4 features, requiring prediction of 3 flower categories.

Step 1: Import Necessary Libraries

python
# Import basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Import Scikit-learn components
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Set Chinese font (optional)
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

Step 2: Load and Explore Data

python
# Load Iris dataset
iris = load_iris()

# View dataset information
print("Dataset Description:")
print(iris.DESCR[:500] + "...")
print("\n" + "="*50)

# Get features and labels
X = iris.data  # Feature matrix
y = iris.target  # Label vector

# View data shapes
print(f"Feature Matrix Shape: {X.shape}")
print(f"Label Vector Shape: {y.shape}")

# View feature names
print(f"Feature Names: {iris.feature_names}")
print(f"Class Names: {iris.target_names}")

Step 3: Data Exploration and Visualization

python
# Create DataFrame for easier analysis
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = iris.target_names[y]

# View first few rows of data
print("First 5 Rows of Data:")
print(df.head())

# View data statistics
print("\nData Statistics:")
print(df.describe())

# View class distribution
print("\nClass Distribution:")
print(df['species'].value_counts())

Step 4: Data Visualization

python
# Create figure
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Iris Dataset Feature Distribution', fontsize=16)

# Plot distribution of each feature
features = iris.feature_names
for i, feature in enumerate(features):
    row = i // 2
    col = i % 2
    
    # Plot histogram by category
    for species in iris.target_names:
        data = df[df['species'] == species][feature]
        axes[row, col].hist(data, alpha=0.7, label=species, bins=15)
    
    axes[row, col].set_title(feature)
    axes[row, col].set_xlabel('Value')
    axes[row, col].set_ylabel('Frequency')
    axes[row, col].legend()

plt.tight_layout()
plt.show()

# Plot relationships between features
plt.figure(figsize=(10, 8))
sns.pairplot(df, hue='species', markers=["o", "s", "D"])
plt.suptitle('Feature Relationships', y=1.02)
plt.show()

Step 5: Data Splitting

python
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2,      # 20% as test set
    random_state=42,      # Set random seed for reproducibility
    stratify=y           # Maintain class proportions
)

print(f"Training Set Size: {X_train.shape[0]}")
print(f"Test Set Size: {X_test.shape[0]}")

# View class distribution after splitting
print("\nTraining Set Class Distribution:")
unique, counts = np.unique(y_train, return_counts=True)
for i, count in enumerate(counts):
    print(f"{iris.target_names[unique[i]]}: {count}")

print("\nTest Set Class Distribution:")
unique, counts = np.unique(y_test, return_counts=True)
for i, count in enumerate(counts):
    print(f"{iris.target_names[unique[i]]}: {count}")

Step 6: Train Model

python
# Create logistic regression model
model = LogisticRegression(random_state=42)

# Train model
print("Training model...")
model.fit(X_train, y_train)
print("Model training complete!")

# View model parameters
print(f"\nModel Coefficient Shape: {model.coef_.shape}")
print(f"Model Intercept: {model.intercept_}")

Step 7: Model Prediction

python
# Make predictions on test set
y_pred = model.predict(X_test)

# Prediction probabilities
y_pred_proba = model.predict_proba(X_test)

print("Prediction Results (First 10):")
for i in range(min(10, len(y_test))):
    true_label = iris.target_names[y_test[i]]
    pred_label = iris.target_names[y_pred[i]]
    confidence = np.max(y_pred_proba[i])
    print(f"True: {true_label:12} | Predicted: {pred_label:12} | Confidence: {confidence:.3f}")

Step 8: Model Evaluation

python
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.3f} ({accuracy*100:.1f}%)")

# Detailed classification report
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)

# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=iris.target_names,
            yticklabels=iris.target_names)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()

Step 9: Model Application

python
# Create new samples for prediction
new_samples = np.array([
    [5.1, 3.5, 1.4, 0.2],  # Features similar to setosa
    [6.2, 2.8, 4.8, 1.8],  # Features similar to versicolor
    [7.2, 3.0, 5.8, 1.6]   # Features similar to virginica
])

# Make predictions
predictions = model.predict(new_samples)
probabilities = model.predict_proba(new_samples)

print("New Sample Prediction Results:")
for i, (sample, pred, prob) in enumerate(zip(new_samples, predictions, probabilities)):
    print(f"\nSample {i+1}: {sample}")
    print(f"Predicted Class: {iris.target_names[pred]}")
    print(f"Class Probabilities:")
    for j, class_name in enumerate(iris.target_names):
        print(f"  {class_name}: {prob[j]:.3f}")

2.3 Complete Code Example

Integrate all the above steps into a complete script:

python
# iris_classification.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

def main():
    # 1. Load data
    iris = load_iris()
    X, y = iris.data, iris.target
    
    # 2. Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=y
    )
    
    # 3. Train model
    model = LogisticRegression(random_state=42)
    model.fit(X_train, y_train)
    
    # 4. Predict and evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    print(f"Model Accuracy: {accuracy:.3f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred, target_names=iris.target_names))
    
    return model, iris

if __name__ == "__main__":
    model, iris = main()

2.4 Core Advantages of Scikit-learn

Through this simple example, we can see several core advantages of Scikit-learn:

1. Unified API Design

All models follow the same interface:

  • fit(X, y): Train model
  • predict(X): Make predictions
  • score(X, y): Evaluate performance

2. Rich Algorithm Library

python
# Can easily switch between different algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

# Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Support Vector Machine
svm_model = SVC(random_state=42)
svm_model.fit(X_train, y_train)

# K-Nearest Neighbors
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(X_train, y_train)

3. Complete Tool Chain

  • Data preprocessing
  • Model selection
  • Performance evaluation
  • Cross-validation

2.5 Common Machine Learning Workflow

python
# Standard machine learning workflow
def ml_workflow(X, y, model_class, **model_params):
    """
    Standard machine learning workflow
    """
    # 1. Split data
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42
    )
    
    # 2. Create and train model
    model = model_class(**model_params)
    model.fit(X_train, y_train)
    
    # 3. Predict and evaluate
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    
    return model, accuracy

# Usage example
from sklearn.ensemble import RandomForestClassifier

model, acc = ml_workflow(X, y, RandomForestClassifier, n_estimators=100)
print(f"Random Forest Accuracy: {acc:.3f}")

2.6 Exercises

Exercise 1: Basic Operations

  1. Use different test_size values (0.1, 0.3, 0.4) and observe the impact on model performance
  2. Try not setting random_state and run the code multiple times to observe result changes

Exercise 2: Algorithm Comparison

Use the following algorithms to classify the Iris dataset and compare their performance:

  • Decision Tree (DecisionTreeClassifier)
  • Random Forest (RandomForestClassifier)
  • Support Vector Machine (SVC)

Exercise 3: Data Exploration

  1. Calculate the mean and standard deviation of each feature
  2. Find which two features have the strongest correlation
  3. Plot the distribution of each class in a two-dimensional feature space

Exercise 4: Predict New Samples

Create 5 new Iris samples, use the trained model to make predictions, and analyze prediction confidence.

2.7 Summary

In this chapter, we learned:

  1. Basic Machine Learning Concepts: Features, labels, training set, test set
  2. Complete ML Workflow: Data loading → exploration → splitting → training → prediction → evaluation
  3. Scikit-learn Core API: fit(), predict(), score()
  4. Model Evaluation Methods: Accuracy, classification report, confusion matrix
  5. Data Visualization Techniques: Histograms, scatter plots, heatmaps

Key Points

  • Scikit-learn provides unified, concise API
  • Machine learning projects follow a standard workflow
  • Data exploration and visualization are important first steps
  • Model evaluation is more than just looking at accuracy

2.8 Next Steps

Now you have experienced the complete machine learning workflow! In the next chapter Data Preprocessing Basics, we will dive deeper into how to handle "dirty" data in the real world, which is a key step to successfully building machine learning models.


Chapter Key Points Review:

  • ✅ Mastered basic usage of Scikit-learn
  • ✅ Understood standard machine learning workflow
  • ✅ Learned basic model evaluation methods
  • ✅ Experienced the complete process from data to prediction

Content is for learning and research only.