Chapter 2: Quick Start Guide
Welcome to the Scikit-learn Quick Start Guide! This chapter will take you through a complete machine learning example, allowing you to quickly experience the entire process from data loading to model prediction.
2.1 Basic Machine Learning Concepts
Before starting coding, let's understand several core concepts:
- Features: Input variables used for prediction, usually denoted as X
- Labels: Target variables we want to predict, usually denoted as y
- Training Set: Data used for training the model
- Test Set: Data used for evaluating model performance
- Model: Algorithm that learns data patterns
2.2 First Machine Learning Project: Iris Classification
We will use the famous Iris dataset to build a classification model. This dataset contains 150 iris samples, each with 4 features, requiring prediction of 3 flower categories.
Step 1: Import Necessary Libraries
# Import basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Import Scikit-learn components
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
# Set Chinese font (optional)
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = FalseStep 2: Load and Explore Data
# Load Iris dataset
iris = load_iris()
# View dataset information
print("Dataset Description:")
print(iris.DESCR[:500] + "...")
print("\n" + "="*50)
# Get features and labels
X = iris.data # Feature matrix
y = iris.target # Label vector
# View data shapes
print(f"Feature Matrix Shape: {X.shape}")
print(f"Label Vector Shape: {y.shape}")
# View feature names
print(f"Feature Names: {iris.feature_names}")
print(f"Class Names: {iris.target_names}")Step 3: Data Exploration and Visualization
# Create DataFrame for easier analysis
df = pd.DataFrame(X, columns=iris.feature_names)
df['species'] = iris.target_names[y]
# View first few rows of data
print("First 5 Rows of Data:")
print(df.head())
# View data statistics
print("\nData Statistics:")
print(df.describe())
# View class distribution
print("\nClass Distribution:")
print(df['species'].value_counts())Step 4: Data Visualization
# Create figure
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Iris Dataset Feature Distribution', fontsize=16)
# Plot distribution of each feature
features = iris.feature_names
for i, feature in enumerate(features):
row = i // 2
col = i % 2
# Plot histogram by category
for species in iris.target_names:
data = df[df['species'] == species][feature]
axes[row, col].hist(data, alpha=0.7, label=species, bins=15)
axes[row, col].set_title(feature)
axes[row, col].set_xlabel('Value')
axes[row, col].set_ylabel('Frequency')
axes[row, col].legend()
plt.tight_layout()
plt.show()
# Plot relationships between features
plt.figure(figsize=(10, 8))
sns.pairplot(df, hue='species', markers=["o", "s", "D"])
plt.suptitle('Feature Relationships', y=1.02)
plt.show()Step 5: Data Splitting
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y,
test_size=0.2, # 20% as test set
random_state=42, # Set random seed for reproducibility
stratify=y # Maintain class proportions
)
print(f"Training Set Size: {X_train.shape[0]}")
print(f"Test Set Size: {X_test.shape[0]}")
# View class distribution after splitting
print("\nTraining Set Class Distribution:")
unique, counts = np.unique(y_train, return_counts=True)
for i, count in enumerate(counts):
print(f"{iris.target_names[unique[i]]}: {count}")
print("\nTest Set Class Distribution:")
unique, counts = np.unique(y_test, return_counts=True)
for i, count in enumerate(counts):
print(f"{iris.target_names[unique[i]]}: {count}")Step 6: Train Model
# Create logistic regression model
model = LogisticRegression(random_state=42)
# Train model
print("Training model...")
model.fit(X_train, y_train)
print("Model training complete!")
# View model parameters
print(f"\nModel Coefficient Shape: {model.coef_.shape}")
print(f"Model Intercept: {model.intercept_}")Step 7: Model Prediction
# Make predictions on test set
y_pred = model.predict(X_test)
# Prediction probabilities
y_pred_proba = model.predict_proba(X_test)
print("Prediction Results (First 10):")
for i in range(min(10, len(y_test))):
true_label = iris.target_names[y_test[i]]
pred_label = iris.target_names[y_pred[i]]
confidence = np.max(y_pred_proba[i])
print(f"True: {true_label:12} | Predicted: {pred_label:12} | Confidence: {confidence:.3f}")Step 8: Model Evaluation
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"\nModel Accuracy: {accuracy:.3f} ({accuracy*100:.1f}%)")
# Detailed classification report
print("\nDetailed Classification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(cm)
# Visualize confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
xticklabels=iris.target_names,
yticklabels=iris.target_names)
plt.title('Confusion Matrix')
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.show()Step 9: Model Application
# Create new samples for prediction
new_samples = np.array([
[5.1, 3.5, 1.4, 0.2], # Features similar to setosa
[6.2, 2.8, 4.8, 1.8], # Features similar to versicolor
[7.2, 3.0, 5.8, 1.6] # Features similar to virginica
])
# Make predictions
predictions = model.predict(new_samples)
probabilities = model.predict_proba(new_samples)
print("New Sample Prediction Results:")
for i, (sample, pred, prob) in enumerate(zip(new_samples, predictions, probabilities)):
print(f"\nSample {i+1}: {sample}")
print(f"Predicted Class: {iris.target_names[pred]}")
print(f"Class Probabilities:")
for j, class_name in enumerate(iris.target_names):
print(f" {class_name}: {prob[j]:.3f}")2.3 Complete Code Example
Integrate all the above steps into a complete script:
# iris_classification.py
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
def main():
# 1. Load data
iris = load_iris()
X, y = iris.data, iris.target
# 2. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# 3. Train model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
# 4. Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.3f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=iris.target_names))
return model, iris
if __name__ == "__main__":
model, iris = main()2.4 Core Advantages of Scikit-learn
Through this simple example, we can see several core advantages of Scikit-learn:
1. Unified API Design
All models follow the same interface:
fit(X, y): Train modelpredict(X): Make predictionsscore(X, y): Evaluate performance
2. Rich Algorithm Library
# Can easily switch between different algorithms
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
# Random Forest
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
# Support Vector Machine
svm_model = SVC(random_state=42)
svm_model.fit(X_train, y_train)
# K-Nearest Neighbors
knn_model = KNeighborsClassifier(n_neighbors=3)
knn_model.fit(X_train, y_train)3. Complete Tool Chain
- Data preprocessing
- Model selection
- Performance evaluation
- Cross-validation
2.5 Common Machine Learning Workflow
# Standard machine learning workflow
def ml_workflow(X, y, model_class, **model_params):
"""
Standard machine learning workflow
"""
# 1. Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# 2. Create and train model
model = model_class(**model_params)
model.fit(X_train, y_train)
# 3. Predict and evaluate
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
return model, accuracy
# Usage example
from sklearn.ensemble import RandomForestClassifier
model, acc = ml_workflow(X, y, RandomForestClassifier, n_estimators=100)
print(f"Random Forest Accuracy: {acc:.3f}")2.6 Exercises
Exercise 1: Basic Operations
- Use different
test_sizevalues (0.1, 0.3, 0.4) and observe the impact on model performance - Try not setting
random_stateand run the code multiple times to observe result changes
Exercise 2: Algorithm Comparison
Use the following algorithms to classify the Iris dataset and compare their performance:
- Decision Tree (
DecisionTreeClassifier) - Random Forest (
RandomForestClassifier) - Support Vector Machine (
SVC)
Exercise 3: Data Exploration
- Calculate the mean and standard deviation of each feature
- Find which two features have the strongest correlation
- Plot the distribution of each class in a two-dimensional feature space
Exercise 4: Predict New Samples
Create 5 new Iris samples, use the trained model to make predictions, and analyze prediction confidence.
2.7 Summary
In this chapter, we learned:
- Basic Machine Learning Concepts: Features, labels, training set, test set
- Complete ML Workflow: Data loading → exploration → splitting → training → prediction → evaluation
- Scikit-learn Core API:
fit(),predict(),score() - Model Evaluation Methods: Accuracy, classification report, confusion matrix
- Data Visualization Techniques: Histograms, scatter plots, heatmaps
Key Points
- Scikit-learn provides unified, concise API
- Machine learning projects follow a standard workflow
- Data exploration and visualization are important first steps
- Model evaluation is more than just looking at accuracy
2.8 Next Steps
Now you have experienced the complete machine learning workflow! In the next chapter Data Preprocessing Basics, we will dive deeper into how to handle "dirty" data in the real world, which is a key step to successfully building machine learning models.
Chapter Key Points Review:
- ✅ Mastered basic usage of Scikit-learn
- ✅ Understood standard machine learning workflow
- ✅ Learned basic model evaluation methods
- ✅ Experienced the complete process from data to prediction