Skip to content

Pandas Series Data Structure

Series is the most basic one-dimensional data structure in Pandas, which can be understood as a labeled array or dictionary. This chapter will comprehensively introduce the creation, operations, and applications of Series.

📚 Series Overview

What is Series

Series is a one-dimensional labeled array that can hold any data type (integers, strings, floats, Python objects, etc.). It consists of two main parts:

  • Data (values): The actual stored data
  • Index: Labels for the data

Characteristics of Series

  • One-dimensional structure: Similar to arrays or lists
  • Labeled index: Each element has a corresponding label
  • Homogeneous data: All elements have the same data type
  • Fixed size: Length is fixed after creation
  • Mutable data: Element values can be modified

🔨 Creating Series

Creating from a List

python
import pandas as pd
import numpy as np

# Create Series from a list
data = [10, 20, 30, 40, 50]
s1 = pd.Series(data)
print(s1)
# Output:
# 0    10
# 1    20
# 2    30
# 3    40
# 4    50
# dtype: int64

# Specify index
s2 = pd.Series(data, index=['a', 'b', 'c', 'd', 'e'])
print(s2)
# Output:
# a    10
# b    20
# c    30
# d    40
# e    50
# dtype: int64

Creating from a Dictionary

python
# Create Series from a dictionary
data_dict = {
    'Beijing': 2154,
    'Shanghai': 2424,
    'Guangzhou': 1491,
    'Shenzhen': 1344,
    'Hangzhou': 1036
}

population = pd.Series(data_dict)
print(population)
# Output:
# Beijing      2154
# Shanghai     2424
# Guangzhou    1491
# Shenzhen     1344
# Hangzhou     1036
# dtype: int64

Creating from NumPy Array

python
# Create from NumPy array
arr = np.array([1.1, 2.2, 3.3, 4.4, 5.5])
s3 = pd.Series(arr, index=['A', 'B', 'C', 'D', 'E'])
print(s3)
# Output:
# A    1.1
# B    2.2
# C    3.3
# D    4.4
# E    5.5
# dtype: float64

Creating from Scalar Value

python
# Create from scalar value (requires index specification)
s4 = pd.Series(100, index=['x', 'y', 'z'])
print(s4)
# Output:
# x    100
# y    100
# z    100
# dtype: int64

Creating Special Series

python
# Create empty Series
empty_series = pd.Series(dtype=float)
print(f"Empty Series: {empty_series}")

# Create date sequence
date_range = pd.date_range('2024-01-01', periods=5, freq='D')
date_series = pd.Series(range(1, 6), index=date_range)
print(date_series)

# Create categorical data
categories = pd.Categorical(['A', 'B', 'A', 'C', 'B'])
cat_series = pd.Series(categories)
print(cat_series)

🔍 Series Attributes

Basic Attributes

python
# Create example Series
scores = pd.Series([85, 92, 78, 96, 88], 
                  index=['Alice', 'Bob', 'Charlie', 'David', 'Eve'])

# View basic attributes
print(f"Data type: {scores.dtype}")           # int64
print(f"Shape: {scores.shape}")               # (5,)
print(f"Size: {scores.size}")                 # 5
print(f"Dimensions: {scores.ndim}")           # 1
print(f"Index: {scores.index}")               # Index(['Alice', 'Bob', 'Charlie', 'David', 'Eve'])
print(f"Values: {scores.values}")             # [85 92 78 96 88]
print(f"Name: {scores.name}")                 # None

Setting Names

python
# Set Series and index names
scores.name = 'Exam Scores'
scores.index.name = 'Student Name'
print(scores)
# Output:
# Student Name
# Alice      85
# Bob        92
# Charlie    78
# David      96
# Eve        88
# Name: Exam Scores, dtype: int64

Memory Usage

python
# Check memory usage
print(f"Memory usage: {scores.memory_usage()} bytes")
print(f"Memory usage (deep): {scores.memory_usage(deep=True)} bytes")

🎯 Indexing and Selection

Position-based Indexing

python
# Create example data
fruits = pd.Series(['Apple', 'Banana', 'Orange', 'Grape', 'Strawberry'], 
                  index=['A', 'B', 'C', 'D', 'E'])

# Position-based indexing (starting from 0)
print(fruits[0])        # Apple
print(fruits[2])        # Orange
print(fruits[-1])       # Strawberry

# Slicing
print(fruits[1:4])      # B to D (excluding E)
print(fruits[:3])       # First 3
print(fruits[2:])       # From the 3rd element

Label-based Indexing

python
# Label-based indexing
print(fruits['A'])      # Apple
print(fruits['C'])      # Orange

# Multiple labels
print(fruits[['A', 'C', 'E']])
# Output:
# A        Apple
# C       Orange
# E    Strawberry
# dtype: object

Boolean Indexing

python
# Create numeric Series
temperatures = pd.Series([22, 25, 19, 30, 27, 24], 
                        index=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'])

# Boolean indexing
hot_days = temperatures > 25
print(hot_days)
# Output:
# Monday       False
# Tuesday      False
# Wednesday    False
# Thursday      True
# Friday        True
# Saturday     False
# dtype: bool

# Filter hot days
print(temperatures[hot_days])
# Output:
# Thursday    30
# Friday      27
# dtype: int64

# Compound conditions
comfortable = temperatures[(temperatures >= 20) & (temperatures <= 25)]
print(comfortable)

Advanced Indexing Methods

python
# iloc: Position-based indexing
print(fruits.iloc[0])       # First element
print(fruits.iloc[1:3])     # 2nd to 3rd elements

# loc: Label-based indexing
print(fruits.loc['A'])      # Element with label 'A'
print(fruits.loc['B':'D'])  # Labels from 'B' to 'D'

# at and iat: Fast access to single elements
print(fruits.at['A'])       # Same as fruits['A']
print(fruits.iat[0])        # Same as fruits.iloc[0]

🔧 Series Operations

Mathematical Operations

python
# Create numeric Series
prices = pd.Series([10.5, 20.3, 15.8, 25.2, 18.7], 
                  index=['Product A', 'Product B', 'Product C', 'Product D', 'Product E'])

# Scalar operations
print("Original prices:")
print(prices)

print("\n10% discount:")
print(prices * 0.9)

print("\nAfter tax (+10%):")
print(prices * 1.1)

print("\nAdd $5 to each:")
print(prices + 5)

Operations Between Series

python
# Create two Series
q1_sales = pd.Series([100, 150, 200, 120], 
                    index=['Product A', 'Product B', 'Product C', 'Product D'])
q2_sales = pd.Series([120, 180, 190, 140], 
                    index=['Product A', 'Product B', 'Product C', 'Product D'])

# Addition
total_sales = q1_sales + q2_sales
print("Total sales:")
print(total_sales)

# Growth rate
growth_rate = (q2_sales - q1_sales) / q1_sales * 100
print("\nGrowth rate (%):")
print(growth_rate)

Operations with Different Indexes

python
# Operations between Series with different indexes
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6, 7], index=['a', 'b', 'd', 'e'])

result = s1 + s2
print(result)
# Output:
# a    5.0
# b    7.0
# c    NaN
# d    NaN
# e    NaN
# dtype: float64

# Use fill_value to handle missing values
result_filled = s1.add(s2, fill_value=0)
print(result_filled)

📊 Statistical Methods

Descriptive Statistics

python
# Create score data
student_scores = pd.Series([85, 92, 78, 96, 88, 91, 83, 89, 94, 87])

# Basic statistics
print(f"Mean: {student_scores.mean():.2f}")
print(f"Median: {student_scores.median():.2f}")
print(f"Standard deviation: {student_scores.std():.2f}")
print(f"Variance: {student_scores.var():.2f}")
print(f"Minimum: {student_scores.min()}")
print(f"Maximum: {student_scores.max()}")
print(f"Sum: {student_scores.sum()}")
print(f"Count: {student_scores.count()}")

# Quantiles
print(f"25th percentile: {student_scores.quantile(0.25)}")
print(f"75th percentile: {student_scores.quantile(0.75)}")

# Complete description
print("\nComplete statistical description:")
print(student_scores.describe())

Sorting and Ranking

python
# Sort by values
sorted_scores = student_scores.sort_values(ascending=False)
print("Scores from highest to lowest:")
print(sorted_scores)

# Sort by index
sorted_by_index = student_scores.sort_index()
print("\nSorted by index:")
print(sorted_by_index)

# Ranking
ranks = student_scores.rank(ascending=False)
print("\nScore rankings:")
print(ranks)

Unique Values and Counts

python
# Create Series with duplicate values
grades = pd.Series(['A', 'B', 'A', 'C', 'B', 'A', 'D', 'C', 'B'])

# Unique values
print(f"Unique values: {grades.unique()}")
print(f"Number of unique values: {grades.nunique()}")

# Value counts
print("\nCount by grade:")
print(grades.value_counts())

# Counts sorted by index
print("\nCounts sorted by grade:")
print(grades.value_counts().sort_index())

🔄 Data Processing

Handling Missing Values

python
# Create Series with missing values
data_with_nan = pd.Series([1, 2, np.nan, 4, 5, np.nan, 7])
print("Original data:")
print(data_with_nan)

# Detect missing values
print(f"\nMissing value detection: {data_with_nan.isnull()}")
print(f"Non-null detection: {data_with_nan.notnull()}")
print(f"Count of missing values: {data_with_nan.isnull().sum()}")

# Drop missing values
print("\nAfter dropping missing values:")
print(data_with_nan.dropna())

# Fill missing values
print("\nFill with 0:")
print(data_with_nan.fillna(0))

print("\nFill with mean:")
print(data_with_nan.fillna(data_with_nan.mean()))

print("\nForward fill:")
print(data_with_nan.fillna(method='ffill'))

print("\nBackward fill:")
print(data_with_nan.fillna(method='bfill'))

Data Transformation

python
# Create string Series
names = pd.Series(['Alice', 'Bob', 'Charlie', 'David'])

# String methods
print("Original names:")
print(names)

print("\nUppercase:")
print(names.str.upper())

print("\nAdd suffix:")
print(names + ' (Student)')

print("\nString length:")
print(names.str.len())

# Numeric conversion
number_strings = pd.Series(['1', '2', '3', '4', '5'])
print("\nString to numeric:")
print(pd.to_numeric(number_strings))

# Type conversion
float_series = pd.Series([1.1, 2.2, 3.3, 4.4])
print("\nFloat to integer:")
print(float_series.astype(int))

Applying Functions

python
# Create numeric Series
numbers = pd.Series([1, 4, 9, 16, 25])

# Apply built-in function
print("Square root:")
print(numbers.apply(np.sqrt))

# Apply custom function
def classify_number(x):
    if x < 10:
        return 'Small'
    elif x < 20:
        return 'Medium'
    else:
        return 'Large'

print("\nNumber classification:")
print(numbers.apply(classify_number))

# Use lambda function
print("\nSquare:")
print(numbers.apply(lambda x: x ** 2))

🔗 Series Merging and Concatenation

Concatenating Series

python
# Create multiple Series
s1 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s2 = pd.Series([4, 5, 6], index=['d', 'e', 'f'])
s3 = pd.Series([7, 8, 9], index=['g', 'h', 'i'])

# Concatenate Series
concatenated = pd.concat([s1, s2, s3])
print("Concatenated Series:")
print(concatenated)

# Reset index
reset_index = pd.concat([s1, s2, s3], ignore_index=True)
print("\nAfter resetting index:")
print(reset_index)

Appending Elements

python
# Append single element (deprecated, use concat)
original = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
new_element = pd.Series([4], index=['d'])
appended = pd.concat([original, new_element])
print("After appending element:")
print(appended)

🎨 Practical Application Examples

Example 1: Stock Price Analysis

python
# Simulate stock price data
stock_prices = pd.Series([
    100.5, 102.3, 98.7, 105.2, 107.8, 103.4, 109.1, 106.5, 111.2, 108.9
], index=pd.date_range('2024-01-01', periods=10, freq='D'))

stock_prices.name = 'Stock Price'
stock_prices.index.name = 'Date'

print("Stock price data:")
print(stock_prices)

# Calculate daily returns
daily_returns = stock_prices.pct_change() * 100
print(f"\nAverage daily return: {daily_returns.mean():.2f}%")
print(f"Return standard deviation: {daily_returns.std():.2f}%")

# Find dates with maximum gain and loss
max_gain_date = daily_returns.idxmax()
max_loss_date = daily_returns.idxmin()
print(f"\nMax gain date: {max_gain_date}, Gain: {daily_returns[max_gain_date]:.2f}%")
print(f"Max loss date: {max_loss_date}, Loss: {daily_returns[max_loss_date]:.2f}%")

Example 2: Sales Data Analysis

python
# Monthly sales data
monthly_sales = pd.Series([
    120000, 135000, 142000, 158000, 163000, 171000,
    185000, 192000, 178000, 165000, 155000, 148000
], index=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
         'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

monthly_sales.name = 'Monthly Sales'

print("Monthly sales data:")
print(monthly_sales)

# Sales statistics
print(f"\nAnnual total sales: {monthly_sales.sum():,}")
print(f"Average monthly sales: {monthly_sales.mean():,.0f}")
print(f"Median sales: {monthly_sales.median():,.0f}")

# Find best and worst months
best_month = monthly_sales.idxmax()
worst_month = monthly_sales.idxmin()
print(f"\nBest month: {best_month} ({monthly_sales[best_month]:,})")
print(f"Worst month: {worst_month} ({monthly_sales[worst_month]:,})")

# Calculate month-over-month growth rate
month_over_month = monthly_sales.pct_change() * 100
print("\nMonth-over-month growth rate:")
print(month_over_month.dropna().round(2))

Example 3: Student Grade Analysis

python
# Student grade data
student_grades = pd.Series({
    'Alice': 85, 'Bob': 92, 'Charlie': 78, 'David': 96, 'Eve': 88,
    'Frank': 91, 'Grace': 83, 'Henry': 89, 'Ivy': 94, 'Jack': 87
})

student_grades.name = 'Final Grade'
student_grades.index.name = 'Student Name'

print("Student grades:")
print(student_grades)

# Grade level classification
def grade_level(score):
    if score >= 90:
        return 'A'
    elif score >= 80:
        return 'B'
    elif score >= 70:
        return 'C'
    elif score >= 60:
        return 'D'
    else:
        return 'F'

grade_levels = student_grades.apply(grade_level)
print("\nGrade levels:")
print(grade_levels)

# Grade statistics
print("\nGrade distribution:")
print(grade_levels.value_counts().sort_index())

# Excellent students (90 or above)
excellent_students = student_grades[student_grades >= 90]
print(f"\nExcellent students (>=90): {len(excellent_students)} students")
print(excellent_students.sort_values(ascending=False))

🔍 Converting Series to Other Data Structures

Converting to Other Types

python
# Create example Series
data = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])

# Convert to list
print(f"Convert to list: {data.tolist()}")

# Convert to array
print(f"Convert to array: {data.values}")

# Convert to dictionary
print(f"Convert to dictionary: {data.to_dict()}")

# Convert to DataFrame
df = data.to_frame(name='Value')
print("\nConvert to DataFrame:")
print(df)

📈 Performance Optimization Tips

Vectorized Operations

python
import time

# Create large Series
large_series = pd.Series(np.random.randn(1000000))

# Compare performance of loops vs vectorized operations
# Method 1: Loop (slow)
start_time = time.time()
result1 = pd.Series([x**2 if x > 0 else 0 for x in large_series])
loop_time = time.time() - start_time

# Method 2: Vectorized (fast)
start_time = time.time()
result2 = large_series.where(large_series > 0, 0) ** 2
vectorized_time = time.time() - start_time

print(f"Loop method time: {loop_time:.4f} seconds")
print(f"Vectorized method time: {vectorized_time:.4f} seconds")
print(f"Performance improvement: {loop_time/vectorized_time:.1f}x")

Memory Optimization

python
# Choose appropriate data types
# Default integer type
default_int = pd.Series([1, 2, 3, 4, 5])
print(f"Default int type: {default_int.dtype}, Memory: {default_int.memory_usage()} bytes")

# Optimized integer type
optimized_int = pd.Series([1, 2, 3, 4, 5], dtype='int8')
print(f"Optimized int type: {optimized_int.dtype}, Memory: {optimized_int.memory_usage()} bytes")

# Categorical data optimization
colors = pd.Series(['Red', 'Green', 'Blue'] * 1000)
print(f"String type memory: {colors.memory_usage(deep=True)} bytes")

colors_cat = colors.astype('category')
print(f"Category type memory: {colors_cat.memory_usage(deep=True)} bytes")
print(f"Memory savings: {(1 - colors_cat.memory_usage(deep=True)/colors.memory_usage(deep=True))*100:.1f}%")

📝 Chapter Summary

Through this chapter, you should have mastered:

Series Basic Concepts: Understanding the structure and characteristics of Series
Creating Series: Mastering various methods for creating Series
Indexing and Selection: Proficiently using various indexing methods
Data Operations: Performing mathematical operations and data processing
Statistical Analysis: Using statistical methods to analyze data
Practical Applications: Solving real data analysis problems
Performance Optimization: Improving code execution efficiency

Key Points

  1. Series is the Foundation of Pandas: Understanding Series is crucial for learning DataFrame
  2. Importance of Indexing: Proper use of indexing can greatly improve data processing efficiency
  3. Vectorized Operations: Avoid loops, use Pandas built-in methods
  4. Data Type Optimization: Choosing appropriate data types can save memory

Next Steps

Now that you've mastered Series, next we'll learn about Pandas' other core data structure: DataFrame.


Next Chapter: Pandas DataFrame Data Structure

Content is for learning and research only.