Pandas Series Data Structure
Series is the most basic one-dimensional data structure in Pandas, which can be understood as a labeled array or dictionary. This chapter will comprehensively introduce the creation, operations, and applications of Series.
📚 Series Overview
What is Series
Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, Python objects, etc.). It consists of two main parts:
- Data (values): The actual stored data
- Index: Labels for the data
Characteristics of Series
- ✅ One-dimensional structure: Similar to arrays or lists
- ✅ Labeled index: Each element has a corresponding label
- ✅ Homogeneous data: All elements have the same data type
- ✅ Fixed size: Length is fixed after creation
- ✅ Mutable data: Element values can be modified
🔨 Creating Series
Creating from a List
import pandas as pd
import numpy as np
# Create Series from a list
data = [10, 20, 30, 40, 50]
s1 = pd.Series(data)
print(s1)
# Output:
# 0 10
# 1 20
# 2 30
# 3 40
# 4 50
# dtype: int64
# Specify index
s2 = pd.Series(data, index=['a', 'b', 'c', 'd', 'e'])
print(s2)
# Output:
# a 10
# b 20
# c 30
# d 40
# e 50
# dtype: int64Creating from a Dictionary
# Create Series from a dictionary
data_dict = {
'Beijing': 2154,
'Shanghai': 2424,
'Guangzhou': 1491,
'Shenzhen': 1344,
'Hangzhou': 1036
}
population = pd.Series(data_dict)
print(population)
# Output:
# Beijing 2154
# Shanghai 2424
# Guangzhou 1491
# Shenzhen 1344
# Hangzhou 1036
# dtype: int64Creating from NumPy Array
# Create from NumPy array
arr = np.array([1.1, 2.2, 3.3, 4.4, 5.5])
s3 = pd.Series(arr, index=['A', 'B', 'C', 'D', 'E'])
print(s3)
# Output:
# A 1.1
# B 2.2
# C 3.3
# D 4.4
# E 5.5
# dtype: float64Creating from Scalar Value
# Create from scalar value (requires index specification)
s4 = pd.Series(100, index=['x', 'y', 'z'])
print(s4)
# Output:
# x 100
# y 100
# z 100
# dtype: int64Creating Special Series
# Create empty Series
empty_series = pd.Series(dtype=float)
print(f"Empty Series: {empty_series}")
# Create date sequence
date_range = pd.date_range('2024-01-01', periods=5, freq='D')
date_series = pd.Series(range(1, 6), index=date_range)
print(date_series)
# Create categorical data
categories = pd.Categorical(['A', 'B', 'A', 'C', 'B'])
cat_series = pd.Series(categories)
print(cat_series)🔍 Series Attributes
Basic Attributes
# Create example Series
scores = pd.Series([85, 92, 78, 96, 88],
index=['Alice', 'Bob', 'Charlie', 'David', 'Eve'])
# View basic attributes
print(f"Data type: {scores.dtype}") # int64
print(f"Shape: {scores.shape}") # (5,)
print(f"Size: {scores.size}") # 5
print(f"Dimensions: {scores.ndim}") # 1
print(f"Index: {scores.index}") # Index(['Alice', 'Bob', 'Charlie', 'David', 'Eve'])
print(f"Values: {scores.values}") # [85 92 78 96 88]
print(f"Name: {scores.name}") # NoneSetting Names
# Set Series and index names
scores.name = 'Exam Scores'
scores.index.name = 'Student Name'
print(scores)
# Output:
# Student Name
# Alice 85
# Bob 92
# Charlie 78
# David 96
# Eve 88
# Name: Exam Scores, dtype: int64Memory Usage
# Check memory usage
print(f"Memory usage: {scores.memory_usage()} bytes")
print(f"Memory usage (deep): {scores.memory_usage(deep=True)} bytes")🎯 Indexing and Selection
Position-based Indexing
# Create example data
fruits = pd.Series(['Apple', 'Banana', 'Orange', 'Grape', 'Strawberry'],
index=['A', 'B', 'C', 'D', 'E'])
# Position-based indexing (starting from 0)
print(fruits[0]) # Apple
print(fruits[2]) # Orange
print(fruits[-1]) # Strawberry
# Slicing
print(fruits[1:4]) # B to D (excluding E)
print(fruits[:3]) # First 3
print(fruits[2:]) # From the 3rd elementLabel-based Indexing
# Label-based indexing
print(fruits['A']) # Apple
print(fruits['C']) # Orange
# Multiple labels
print(fruits[['A', 'C', 'E']])
# Output:
# A Apple
# C Orange
# E Strawberry
# dtype: objectBoolean Indexing
# Create numeric Series
temperatures = pd.Series([22, 25, 19, 30, 27, 24],
index=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])
# Boolean indexing
hot_days = temperatures > 25
print(hot_days)
# Output:
# Monday False
# Tuesday False
# Wednesday False
# Thursday True
# Friday True
# Saturday False
# dtype: bool
# Filter hot days
print(temperatures[hot_days])
# Output:
# Thursday 30
# Friday 27
# dtype: int64
# Compound conditions
comfortable = temperatures[(temperatures >= 20) & (temperatures <= 25)]
print(comfortable)Advanced Indexing Methods
# iloc: Position-based indexing
print(fruits.iloc[0]) # First element
print(fruits.iloc[1:3]) # 2nd to 3rd elements
# loc: Label-based indexing
print(fruits.loc['A']) # Element with label 'A'
print(fruits.loc['B':'D']) # Labels from 'B' to 'D'
# at and iat: Fast access to single elements
print(fruits.at['A']) # Same as fruits['A']
print(fruits.iat[0]) # Same as fruits.iloc[0]🔧 Series Operations
Mathematical Operations
# Create numeric Series
prices = pd.Series([10.5, 20.3, 15.8, 25.2, 18.7],
index=['Product A', 'Product B', 'Product C', 'Product D', 'Product E'])
# Scalar operations
print("Original prices:")
print(prices)
print("\n10% discount:")
print(prices * 0.9)
print("\nAfter tax (+10%):")
print(prices * 1.1)
print("\nAdd $5 to each:")
print(prices + 5)Operations Between Series
# Create two Series
q1_sales = pd.Series([100, 150, 200, 120],
index=['Product A', 'Product B', 'Product C', 'Product D'])
q2_sales = pd.Series([120, 180, 190, 140],
index=['Product A', 'Product B', 'Product C', 'Product D'])
# Addition
total_sales = q1_sales + q2_sales
print("Total sales:")
print(total_sales)
# Growth rate
growth_rate = (q2_sales - q1_sales) / q1_sales * 100
print("\nGrowth rate (%):")
print(growth_rate)Operations with Different Indexes
# Operations between Series with different indexes
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6, 7], index=['a', 'b', 'd', 'e'])
result = s1 + s2
print(result)
# Output:
# a 5.0
# b 7.0
# c NaN
# d NaN
# e NaN
# dtype: float64
# Use fill_value to handle missing values
result_filled = s1.add(s2, fill_value=0)
print(result_filled)📊 Statistical Methods
Descriptive Statistics
# Create score data
student_scores = pd.Series([85, 92, 78, 96, 88, 91, 83, 89, 94, 87])
# Basic statistics
print(f"Mean: {student_scores.mean():.2f}")
print(f"Median: {student_scores.median():.2f}")
print(f"Standard deviation: {student_scores.std():.2f}")
print(f"Variance: {student_scores.var():.2f}")
print(f"Minimum: {student_scores.min()}")
print(f"Maximum: {student_scores.max()}")
print(f"Sum: {student_scores.sum()}")
print(f"Count: {student_scores.count()}")
# Quantiles
print(f"25th percentile: {student_scores.quantile(0.25)}")
print(f"75th percentile: {student_scores.quantile(0.75)}")
# Complete description
print("\nComplete statistical description:")
print(student_scores.describe())Sorting and Ranking
# Sort by values
sorted_scores = student_scores.sort_values(ascending=False)
print("Scores from highest to lowest:")
print(sorted_scores)
# Sort by index
sorted_by_index = student_scores.sort_index()
print("\nSorted by index:")
print(sorted_by_index)
# Ranking
ranks = student_scores.rank(ascending=False)
print("\nScore rankings:")
print(ranks)Unique Values and Counts
# Create Series with duplicate values
grades = pd.Series(['A', 'B', 'A', 'C', 'B', 'A', 'D', 'C', 'B'])
# Unique values
print(f"Unique values: {grades.unique()}")
print(f"Number of unique values: {grades.nunique()}")
# Value counts
print("\nCount by grade:")
print(grades.value_counts())
# Counts sorted by index
print("\nCounts sorted by grade:")
print(grades.value_counts().sort_index())🔄 Data Processing
Handling Missing Values
# Create Series with missing values
data_with_nan = pd.Series([1, 2, np.nan, 4, 5, np.nan, 7])
print("Original data:")
print(data_with_nan)
# Detect missing values
print(f"\nMissing value detection: {data_with_nan.isnull()}")
print(f"Non-null detection: {data_with_nan.notnull()}")
print(f"Count of missing values: {data_with_nan.isnull().sum()}")
# Drop missing values
print("\nAfter dropping missing values:")
print(data_with_nan.dropna())
# Fill missing values
print("\nFill with 0:")
print(data_with_nan.fillna(0))
print("\nFill with mean:")
print(data_with_nan.fillna(data_with_nan.mean()))
print("\nForward fill:")
print(data_with_nan.fillna(method='ffill'))
print("\nBackward fill:")
print(data_with_nan.fillna(method='bfill'))Data Transformation
# Create string Series
names = pd.Series(['Alice', 'Bob', 'Charlie', 'David'])
# String methods
print("Original names:")
print(names)
print("\nUppercase:")
print(names.str.upper())
print("\nAdd suffix:")
print(names + ' (Student)')
print("\nString length:")
print(names.str.len())
# Numeric conversion
number_strings = pd.Series(['1', '2', '3', '4', '5'])
print("\nString to numeric:")
print(pd.to_numeric(number_strings))
# Type conversion
float_series = pd.Series([1.1, 2.2, 3.3, 4.4])
print("\nFloat to integer:")
print(float_series.astype(int))Applying Functions
# Create numeric Series
numbers = pd.Series([1, 4, 9, 16, 25])
# Apply built-in function
print("Square root:")
print(numbers.apply(np.sqrt))
# Apply custom function
def classify_number(x):
if x < 10:
return 'Small'
elif x < 20:
return 'Medium'
else:
return 'Large'
print("\nNumber classification:")
print(numbers.apply(classify_number))
# Use lambda function
print("\nSquare:")
print(numbers.apply(lambda x: x ** 2))🔗 Series Merging and Concatenation
Concatenating Series
# Create multiple Series
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6], index=['d', 'e', 'f'])
s3 = pd.Series([7, 8, 9], index=['g', 'h', 'i'])
# Concatenate Series
concatenated = pd.concat([s1, s2, s3])
print("Concatenated Series:")
print(concatenated)
# Reset index
reset_index = pd.concat([s1, s2, s3], ignore_index=True)
print("\nAfter resetting index:")
print(reset_index)Appending Elements
# Append single element (deprecated, use concat)
original = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
new_element = pd.Series([4], index=['d'])
appended = pd.concat([original, new_element])
print("After appending element:")
print(appended)🎨 Practical Application Examples
Example 1: Stock Price Analysis
# Simulate stock price data
stock_prices = pd.Series([
100.5, 102.3, 98.7, 105.2, 107.8, 103.4, 109.1, 106.5, 111.2, 108.9
], index=pd.date_range('2024-01-01', periods=10, freq='D'))
stock_prices.name = 'Stock Price'
stock_prices.index.name = 'Date'
print("Stock price data:")
print(stock_prices)
# Calculate daily returns
daily_returns = stock_prices.pct_change() * 100
print(f"\nAverage daily return: {daily_returns.mean():.2f}%")
print(f"Return standard deviation: {daily_returns.std():.2f}%")
# Find dates with maximum gain and loss
max_gain_date = daily_returns.idxmax()
max_loss_date = daily_returns.idxmin()
print(f"\nMax gain date: {max_gain_date}, Gain: {daily_returns[max_gain_date]:.2f}%")
print(f"Max loss date: {max_loss_date}, Loss: {daily_returns[max_loss_date]:.2f}%")Example 2: Sales Data Analysis
# Monthly sales data
monthly_sales = pd.Series([
120000, 135000, 142000, 158000, 163000, 171000,
185000, 192000, 178000, 165000, 155000, 148000
], index=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
monthly_sales.name = 'Monthly Sales'
print("Monthly sales data:")
print(monthly_sales)
# Sales statistics
print(f"\nAnnual total sales: {monthly_sales.sum():,}")
print(f"Average monthly sales: {monthly_sales.mean():,.0f}")
print(f"Median sales: {monthly_sales.median():,.0f}")
# Find best and worst months
best_month = monthly_sales.idxmax()
worst_month = monthly_sales.idxmin()
print(f"\nBest month: {best_month} ({monthly_sales[best_month]:,})")
print(f"Worst month: {worst_month} ({monthly_sales[worst_month]:,})")
# Calculate month-over-month growth rate
month_over_month = monthly_sales.pct_change() * 100
print("\nMonth-over-month growth rate:")
print(month_over_month.dropna().round(2))Example 3: Student Grade Analysis
# Student grade data
student_grades = pd.Series({
'Alice': 85, 'Bob': 92, 'Charlie': 78, 'David': 96, 'Eve': 88,
'Frank': 91, 'Grace': 83, 'Henry': 89, 'Ivy': 94, 'Jack': 87
})
student_grades.name = 'Final Grade'
student_grades.index.name = 'Student Name'
print("Student grades:")
print(student_grades)
# Grade level classification
def grade_level(score):
if score >= 90:
return 'A'
elif score >= 80:
return 'B'
elif score >= 70:
return 'C'
elif score >= 60:
return 'D'
else:
return 'F'
grade_levels = student_grades.apply(grade_level)
print("\nGrade levels:")
print(grade_levels)
# Grade statistics
print("\nGrade distribution:")
print(grade_levels.value_counts().sort_index())
# Excellent students (90 or above)
excellent_students = student_grades[student_grades >= 90]
print(f"\nExcellent students (>=90): {len(excellent_students)} students")
print(excellent_students.sort_values(ascending=False))🔍 Converting Series to Other Data Structures
Converting to Other Types
# Create example Series
data = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
# Convert to list
print(f"Convert to list: {data.tolist()}")
# Convert to array
print(f"Convert to array: {data.values}")
# Convert to dictionary
print(f"Convert to dictionary: {data.to_dict()}")
# Convert to DataFrame
df = data.to_frame(name='Value')
print("\nConvert to DataFrame:")
print(df)📈 Performance Optimization Tips
Vectorized Operations
import time
# Create large Series
large_series = pd.Series(np.random.randn(1000000))
# Compare performance of loops vs vectorized operations
# Method 1: Loop (slow)
start_time = time.time()
result1 = pd.Series([x**2 if x > 0 else 0 for x in large_series])
loop_time = time.time() - start_time
# Method 2: Vectorized (fast)
start_time = time.time()
result2 = large_series.where(large_series > 0, 0) ** 2
vectorized_time = time.time() - start_time
print(f"Loop method time: {loop_time:.4f} seconds")
print(f"Vectorized method time: {vectorized_time:.4f} seconds")
print(f"Performance improvement: {loop_time/vectorized_time:.1f}x")Memory Optimization
# Choose appropriate data types
# Default integer type
default_int = pd.Series([1, 2, 3, 4, 5])
print(f"Default int type: {default_int.dtype}, Memory: {default_int.memory_usage()} bytes")
# Optimized integer type
optimized_int = pd.Series([1, 2, 3, 4, 5], dtype='int8')
print(f"Optimized int type: {optimized_int.dtype}, Memory: {optimized_int.memory_usage()} bytes")
# Categorical data optimization
colors = pd.Series(['Red', 'Green', 'Blue'] * 1000)
print(f"String type memory: {colors.memory_usage(deep=True)} bytes")
colors_cat = colors.astype('category')
print(f"Category type memory: {colors_cat.memory_usage(deep=True)} bytes")
print(f"Memory savings: {(1 - colors_cat.memory_usage(deep=True)/colors.memory_usage(deep=True))*100:.1f}%")📝 Chapter Summary
Through this chapter, you should have mastered:
✅ Series Basic Concepts: Understanding the structure and characteristics of Series
✅ Creating Series: Mastering various methods for creating Series
✅ Indexing and Selection: Proficiently using various indexing methods
✅ Data Operations: Performing mathematical operations and data processing
✅ Statistical Analysis: Using statistical methods to analyze data
✅ Practical Applications: Solving real data analysis problems
✅ Performance Optimization: Improving code execution efficiency
Key Points
- Series is the Foundation of Pandas: Understanding Series is crucial for learning DataFrame
- Importance of Indexing: Proper use of indexing can greatly improve data processing efficiency
- Vectorized Operations: Avoid loops, use Pandas built-in methods
- Data Type Optimization: Choosing appropriate data types can save memory
Next Steps
Now that you've mastered Series, next we'll learn about Pandas' other core data structure: DataFrame.
Next Chapter: Pandas DataFrame Data Structure